School of Computer and Information, Hefei University of Technology, China
Abstract:Material stress analysis is a critical aspect of material design and performance optimization. Under dynamic loading, the global stress evolution in materials exhibits complex spatiotemporal characteristics, especially in two-phase random materials (TRMs). Such kind of material failure is often associated with stress concentration, and the phase boundaries are key locations where stress concentration occurs. In practical engineering applications, the spatiotemporal resolution of acquired microstructural data and its dynamic stress evolution is often limited. This poses challenges for deep learning methods in generating high-resolution spatiotemporal stress fields, particularly for accurately capturing stress concentration regions. In this study, we propose a framework for global stress generation and spatiotemporal super-resolution in TRMs under dynamic loading. First, we introduce a diffusion model-based approach, named as Spatiotemporal Stress Diffusion (STS-diffusion), for generating global spatiotemporal stress data. This framework incorporates Space-Time U-Net (STU-net), and we systematically investigate the impact of different attention positions on model accuracy. Next, we develop a physics-informed network for spatiotemporal super-resolution, termed as Spatiotemporal Super-Resolution Physics-Informed Operator (ST-SRPINN). The proposed ST-SRPINN is an unsupervised learning method. The influence of data-driven and physics-informed loss function weights on model accuracy is explored in detail. Benefiting from physics-based constraints, ST-SRPINN requires only low-resolution stress field data during training and can upscale the spatiotemporal resolution of stress fields to arbitrary magnifications.
Abstract:Stress analysis is an important part of material design. For materials with complex microstructures, such as two-phase random materials (TRMs), material failure is often accompanied by stress concentration. Phase interfaces in two-phase materials are critical for stress concentration. Therefore, the prediction error of stress at phase boundaries is crucial. In practical engineering, the pixels of the obtained material microstructure images are limited, which limits the resolution of stress images generated by deep learning methods, making it difficult to observe stress concentration regions. Existing Image Super-Resolution (ISR) technologies are all based on data-driven supervised learning. However, stress images have natural physical constraints, which provide new ideas for new ISR technologies. In this study, we constructed a stress prediction framework for TRMs. First, the framework uses a proposed Multiple Compositions U-net (MC U-net) to predict stress in low-resolution material microstructures. By considering the phase interface information of the microstructure, the MC U-net effectively reduces the problem of excessive prediction errors at phase boundaries. Secondly, a Mixed Physics-Informed Neural Network (MPINN) based method for stress ISR (SRPINN) was proposed. By introducing the constraints of physical information, the new method does not require paired stress images for training and can increase the resolution of stress images to any multiple. This enables a multiscale analysis of the stress concentration regions at phase boundaries. Finally, we performed stress analysis on TRMs with different phase volume fractions and loading states through transfer learning. The results show the proposed stress prediction framework has satisfactory accuracy and generalization ability.
Abstract:Recent years have witnessed significant progress in reinforcement learning, especially with Zero-like paradigms, which have greatly boosted the generalization and reasoning abilities of large-scale language models. Nevertheless, existing frameworks are often plagued by high implementation complexity and poor reproducibility. To tackle these challenges, we present AlphaZero-Edu, a lightweight, education-focused implementation built upon the mathematical framework of AlphaZero. It boasts a modular architecture that disentangles key components, enabling transparent visualization of the algorithmic processes. Additionally, it is optimized for resource-efficient training on a single NVIDIA RTX 3090 GPU and features highly parallelized self-play data generation, achieving a 3.2-fold speedup with 8 processes. In Gomoku matches, the framework has demonstrated exceptional performance, achieving a consistently high win rate against human opponents. AlphaZero-Edu has been open-sourced at https://github.com/StarLight1212/AlphaZero_Edu, providing an accessible and practical benchmark for both academic research and industrial applications.
Abstract:While large language models (LLMs) have revolutionized text-to-speech (TTS) synthesis through discrete tokenization paradigms, current architectures exhibit fundamental tensions between three critical dimensions: 1) irreversible loss of acoustic characteristics caused by quantization of speech prompts; 2) stringent dependence on precisely aligned prompt speech-text pairs that limit real-world deployment; and 3) catastrophic forgetting of the LLM's native text comprehension during optimization for speech token generation. To address these challenges, we propose an LLM-based text-to-speech Generation approach Optimized via a novel dual-branch ArchiTecture (GOAT-TTS). Our framework introduces two key innovations: (1) The modality-alignment branch combines a speech encoder and projector to capture continuous acoustic embeddings, enabling bidirectional correlation between paralinguistic features (language, timbre, emotion) and semantic text representations without transcript dependency; (2) The speech-generation branch employs modular fine-tuning on top-k layers of an LLM for speech token prediction while freezing the bottom-k layers to preserve foundational linguistic knowledge. Moreover, multi-token prediction is introduced to support real-time streaming TTS synthesis. Experimental results demonstrate that our GOAT-TTS achieves performance comparable to state-of-the-art TTS models while validating the efficacy of synthesized dialect speech data.
Abstract:Self-supervised learning has become a core technique in speech processing, but the high dimensionality of its representations makes discretization essential for improving efficiency. However, existing discretization methods still suffer from significant information loss, resulting in a notable performance gap compared to continuous representations. To overcome these limitations, we propose two quantization-based discretization methods: Product Quantization (PQ) and Random Product Quantization (RPQ). PQ partitions the original feature space into multiple subspaces and independently quantizes each sub-vector, producing a fused set of discrete units that retain diverse information from different subspaces, thus mitigating the loss associated with single-cluster quantization. RPQ further enhances representation diversity by randomly sampling a fixed proportion of feature dimensions multiple times to construct sub-vectors, thereby better capturing the variability in the data distribution. Theoretical analysis shows that RPQ reduces the correlation coefficient rho (where 0 <= rho <= 1) between sub-quantizers. Its quantization error is lower-bounded by the product of rho and epsilon-kms, where epsilon-kms denotes the quantization error of a single K-means quantizer. Experimental results on a combined dataset built from LibriSpeech and ML-SUPERB show that PQ and RPQ outperform standard K-means discretization, achieving relative improvements of 21.8 percent and 20.0 percent in WER on LibriSpeech, and 24.1 percent and 19.6 percent in CER on ML-SUPERB, respectively. Moreover, their performance is competitive with, and in some cases even surpasses, that of continuous SSL representations.
Abstract:Interactive 3D generation is gaining momentum and capturing extensive attention for its potential to create immersive virtual experiences. However, a critical challenge in current 3D generation technologies lies in achieving real-time interactivity. To address this issue, we introduce WonderTurbo, the first real-time interactive 3D scene generation framework capable of generating novel perspectives of 3D scenes within 0.72 seconds. Specifically, WonderTurbo accelerates both geometric and appearance modeling in 3D scene generation. In terms of geometry, we propose StepSplat, an innovative method that constructs efficient 3D geometric representations through dynamic updates, each taking only 0.26 seconds. Additionally, we design QuickDepth, a lightweight depth completion module that provides consistent depth input for StepSplat, further enhancing geometric accuracy. For appearance modeling, we develop FastPaint, a 2-steps diffusion model tailored for instant inpainting, which focuses on maintaining spatial appearance consistency. Experimental results demonstrate that WonderTurbo achieves a remarkable 15X speedup compared to baseline methods, while preserving excellent spatial consistency and delivering high-quality output.
Abstract:Text-to-video (T2V) generation has made tremendous progress in generating complicated scenes based on texts. However, human-object interaction (HOI) often cannot be precisely generated by current T2V models due to the lack of large-scale videos with accurate captions for HOI. To address this issue, we introduce HOIGen-1M, the first largescale dataset for HOI Generation, consisting of over one million high-quality videos collected from diverse sources. In particular, to guarantee the high quality of videos, we first design an efficient framework to automatically curate HOI videos using the powerful multimodal large language models (MLLMs), and then the videos are further cleaned by human annotators. Moreover, to obtain accurate textual captions for HOI videos, we design a novel video description method based on a Mixture-of-Multimodal-Experts (MoME) strategy that not only generates expressive captions but also eliminates the hallucination by individual MLLM. Furthermore, due to the lack of an evaluation framework for generated HOI videos, we propose two new metrics to assess the quality of generated videos in a coarse-to-fine manner. Extensive experiments reveal that current T2V models struggle to generate high-quality HOI videos and confirm that our HOIGen-1M dataset is instrumental for improving HOI video generation. Project webpage is available at https://liuqi-creat.github.io/HOIGen.github.io.
Abstract:Generative models have recently gained attention in recommendation systems by directly predicting item identifiers from user interaction sequences. However, existing methods suffer from significant information loss due to the separation of stages such as quantization and sequence modeling, hindering their ability to achieve the modeling precision and accuracy of sequential dense retrieval techniques. Integrating generative and dense retrieval methods remains a critical challenge. To address this, we introduce the Cascaded Organized Bi-Represented generAtive retrieval (COBRA) framework, which innovatively integrates sparse semantic IDs and dense vectors through a cascading process. Our method alternates between generating these representations by first generating sparse IDs, which serve as conditions to aid in the generation of dense vectors. End-to-end training enables dynamic refinement of dense representations, capturing both semantic insights and collaborative signals from user-item interactions. During inference, COBRA employs a coarse-to-fine strategy, starting with sparse ID generation and refining them into dense vectors via the generative model. We further propose BeamFusion, an innovative approach combining beam search with nearest neighbor scores to enhance inference flexibility and recommendation diversity. Extensive experiments on public datasets and offline tests validate our method's robustness. Online A/B tests on a real-world advertising platform with over 200 million daily users demonstrate substantial improvements in key metrics, highlighting COBRA's practical advantages.
Abstract:The objective of image super-resolution is to reconstruct a high-resolution (HR) image with the prior knowledge from one or several low-resolution (LR) images. However, in the real world, due to the limited complementary information, the performance of both single-frame and multi-frame super-resolution reconstruction degrades rapidly as the magnification increases. In this paper, we propose a novel two-step image super resolution method concatenating multi-frame super-resolution (MFSR) with single-frame super-resolution (SFSR), to progressively upsample images to the desired resolution. The proposed method consisting of an L0-norm constrained reconstruction scheme and an enhanced residual back-projection network, integrating the flexibility of the variational modelbased method and the feature learning capacity of the deep learning-based method. To verify the effectiveness of the proposed algorithm, extensive experiments with both simulated and real world sequences were implemented. The experimental results show that the proposed method yields superior performance in both objective and perceptual quality measurements. The average PSNRs of the cascade model in set5 and set14 are 33.413 dB and 29.658 dB respectively, which are 0.76 dB and 0.621 dB more than the baseline method. In addition, the experiment indicates that this cascade model can be robustly applied to different SFSR and MFSR methods.
Abstract:Neural-networks-driven intelligent data-plane (NN-driven IDP) is becoming an emerging topic for excellent accuracy and high performance. Meanwhile we argue that NN-driven IDP should satisfy three design goals: the flexibility to support various NNs models, the low-latency-high-throughput inference performance, and the data-plane-unawareness harming no performance and functionality. Unfortunately, existing work either over-modify NNs for IDP, or insert inline pipelined accelerators into the data-plane, failing to meet the flexibility and unawareness goals. In this paper, we propose Kaleidoscope, a flexible and high-performance co-processor located at the bypass of the data-plane. To address the challenge of meeting three design goals, three key techniques are presented. The programmable run-to-completion accelerators are developed for flexible inference. To further improve performance, we design a scalable inference engine which completes low-latency and low-cost inference for the mouse flows, and perform complex NNs with high-accuracy for the elephant flows. Finally, raw-bytes-based NNs are introduced, which help to achieve unawareness. We prototype Kaleidoscope on both FPGA and ASIC library. In evaluation on six NNs models, Kaleidoscope reaches 256-352 ns inference latency and 100 Gbps throughput with negligible influence on the data-plane. The on-board tested NNs perform state-of-the-art accuracy among other NN-driven IDP, exhibiting the the significant impact of flexibility on enhancing traffic analysis accuracy.